Remember that text analysis is about assigning numbers to words to measure latent concepts in a text.
Before we assign numbers, it is important to be clear about what is the phenomnon that we try to capture
Theorizing and exploratory data analysis play a role in conceptualization
It is important for our concept to be unbiased, accurate, and valid
Our concept also needs to be reliable
Quntitative text analyses tend to be do well on the second dimension, but would need more work to demonstrate that concepts are also valid.
Dictionary methods blend qualitative and quantitative approaches in text analysis.
Rather than counting all words, dictionaries associate specific words with predefined meanings, enhancing interpretability.
Components of a Dictionary:
| Key | Values |
|---|---|
| Emotion | Happiness, Sadness, Anger, Joy |
| Finance | Investment, Budget, Debt, Savings |
| Health | Exercise, Nutrition, Medicine, Wellness |
A dictionary is just a list of words \((m=1,..,M)\) that is related to a common concept.
| Aggression |
|---|
| fool |
| irritated |
| stupid |
| stubborn |
| accusation |
| accuse |
| ignorant |
Applying a dictionary to a corpus of texts \((i=1,..,N)\) simply requires counting the number of times each word occurs in each text and summing them
Thus, the proportion of words in document \(i\) can be defined as:
\[t_{i}=\frac{\sum^{M}_{m=1} W_{im}}{N_i}\] where:
\(W_{im}\) - number of times word \(m\) appears in text \(i\) and 0 otherwise
\(N_{i}\) - number of words within a document
We need to divide by the number of words as we do not want texts to mechanically be assigned higher scores.
Note
“That statement is as barbaric as it is downright stupid; it is nothing more than an ignorant, cruel and deliberate misconception to hide behind.”
\[t_{i}=\frac{\sum^{M}_{m=1} W_{im}}{N_i}=\frac{1+1}{14}=0.14\]
A slight variation of the dictionary approach is to use weights.
The weights would represent how important different words are for specific concept.
For example, stupid is more aggressive than accuse.
| Aggression | Weight |
|---|---|
| fool | 0.6 |
| irritated | 0.5 |
| stupid | 0.8 |
| stubborn | 0.4 |
| accusation | 0.3 |
| accuse | 0.3 |
| ignorant | 0.3 |
Note
“That statement is as barbaric as it is downright stupid; it is nothing more than an ignorant, cruel and deliberate misconception to hide behind.”
We now adjust the formula
\[t_{i}=\frac{\sum^{M}_{m=1} s_{m} W_{im}}{N_i}=\frac{(1*0.6)+(1*0.3)}{14}=0.06\]
Many applications use unweighted dictionaries.
We sometimes may want to use weighted dictionaries.
Not using weights means that all the words have the same weight.
However, sometimes we may want to use weights to express that some words are more refelctive of specific concepts.
There are many dictionaries to measure different concepts:
There are many dictionaries to measure different concepts:
Applying already-existing dictionaries to new contexts may be problematic.
Applying already-existing dictionaries to new contexts may be problematic.
Note
“Applying dictionaries outside the domain for which they were developed can lead to serious errors” (Grimmer and Stewart, 2013, 268)
Applying already-existing dictionaries to new contexts may be problematic.
Note
“That statement is as barbaric as it is downright stupid; it is nothing more than an ignorant, cruel and deliberate misconception to hide behind.”
Applying already-existing dictionaries to new contexts may be problematic.
Note
“Terrible acts of brutality and violence have been carried out against the Rohingya people.”
How does aggression within political speech compare between men and women?
This is the question asked by Hargrave and Blumenau, 2022
How does aggression within political speech compare between men and women?
This is the question asked by Hargrave and Blumenau, 2022
We can first read all the debates.
Python
person_id name ... body_word_count aggression_rating
0 10042 Mr Gerry Bermingham ... 87 1.0
1 11727 Richard Benyon ... 109 0.0
2 24938 Penny Mordaunt ... 81 0.0
[3 rows x 26 columns]
We have the following columns within the dataframe:
Index(['person_id', 'name', 'constituency', 'entered_house', 'left_house',
'date_of_birth', 'age_years', 'gender', 'house_start', 'days_in_house',
'party', 'party_short', 'year', 'parliamentary_term', 'session',
'pct_con', 'pct_lab', 'pct_ld', 'pct_other', 'margin', 'body',
'question_time', 'debate_type', 'aggressive_word_count',
'body_word_count', 'aggression_rating'],
dtype='object')
The relevant column is body:
| body |
|---|
| Does the Minister agree that if one does not provide litigants with legal aid as speedily and appropriately as possible, one builds up a backlog? Does he agree with the Lord Chancellor's circular that the way in which cases are banked up has delayed matters, just as the failure to grant expert witnesses and the failure to grant civil legal aid to most people have done? In all, there is no justice in this country because the Conservative party has sought to destroy the whole litigation system. |
| My right honourable Friend will know that there is no greater critic of the common fisheries policy than me, but I am sure he would agree that even had we not gone into it, we would probably still have a problem, because man's technical ability to harvest vast quantities from the sea has been a problem the world over. I very much hope that the White Paper contains a firm commitment to an ecosystems approach to fisheries management and that within that there is the possibility of rebalancing fishing opportunity to try to assist the smaller, more local fishing fleet and give it a fairer cut of the opportunity. |
| I congratulate my honourable Friend on his campaign. He is quite right that this is an issue that restricts growth locally. We recognise that and have introduced restricting the use of CCTV to enforce parking, grace periods for on-street parking, and have made it possible for local people to require their local councils to review parking. I draw his attention to the Great British High Street portal, which demonstrates that if local authorities reduce their parking rates they receive greater revenue. |
We can then read the aggression dictionary:
And then examine the first 50 aggression words:
Python
['inadequacy', 'stupid ', 'prejudice ', 'falsehood', 'debase', 'hypocrisy', 'reprehensible', 'wrong ', 'coward', 'sneaky ', 'scapegoats', 'neglected', 'fool', 'cruel', 'blunder ', 'embarrasment ', 'backchat', 'incapable ', 'assault ', 'absurd ', 'mess', 'idiotic ', 'accusing ', 'groan', 'shameful', 'ludicrous', 'confrontational', 'outraged ', 'archaic ', 'deplorable', 'criticise ', 'negligent', 'silliness ', 'misleading', 'assaulting ', 'annoyed ', 'nasty', 'phony ', 'slander ', 'tactic ', 'aggressive ', 'failures ', 'betrayal', 'spite ', 'bitterly ', 'scandal', 'distasteful ', 'trickery ', 'crass', 'hypocracy']
R
aggression_texts_df <- reticulate::py$aggression_texts
# Step1: Create a frequency table of aggressive word counts
aggressive_word_freq <- as.data.frame(table(aggression_texts_df$aggressive_word_count))
# Step2: Renaming columns
colnames(aggressive_word_freq) <- c("aggressive_word_count", "frequency")
# Step3: Convert aggressive_word_count to numeric for correct ordering
aggressive_word_freq$aggressive_word_count <- as.numeric(as.character(aggressive_word_freq$aggressive_word_count))
#Step4: Graphing
library("ggplot2")
ggplot(aggressive_word_freq, aes(x = aggressive_word_count, y = frequency)) +
geom_bar(stat = "identity") +
scale_x_continuous(breaks = seq(1, max(aggressive_word_freq$aggressive_word_count), by = 1)) +
labs(x = "Number of Aggressive Words", y = "Frequency", title = "Frequency of Aggressive Word Counts per Speech")+
theme_bw()R
# Step1: Tuning the pandas dataframe to an R dataframe using reticulate
aggression_texts_df <- reticulate::py$aggression_texts
# Step2: Create a frequency table of aggressive word counts
prop_aggressive_freq <- as.data.frame(table(aggression_texts_df$proportion))
# Step3: Renaming columns
colnames(prop_aggressive_freq) <- c("aggresive_proportion", "frequency")
# Step4: Convert aggressive_word_count to numeric for correct ordering
prop_aggressive_freq$aggresive_proportion <- as.numeric(as.character(prop_aggressive_freq$aggresive_proportion))
# Step5: Creating a histogram
library("ggplot2")
ggplot(prop_aggressive_freq, aes(x = aggresive_proportion)) +
geom_histogram(binwidth = 0.005, color = "white")+
labs(x = "Proportion of Aggressive Words relative to Total Length of Speech",
y = "Frequency", title = "Proportion of Aggressive Word Counts relative to Speech")+
theme_bw()To measure the extent to which we measured our text with error, it is important to conduct validation tests.
The main concern here is whether texts are flagged for other reasons that have less to do with aggression.
There may be different types of validation depending on the research context.
So, should we expect more agression during specific times?
One good way is to see whether agression increases during question time
The following line tells us how many debates there are per debate category:
We can also see these differences if we calculate the average proportion by debate type:
Python
# Group by 'debate_type' and calculate the mean of 'proportions'
mean_dictionary_by_debate_type = (
aggression_texts
.groupby('debate_type', as_index=False) # Group by debate_type, keep the index
.agg(mean_dictionary=('proportion', 'mean')) # Calculate mean of proportions
)
# Display the result
print(mean_dictionary_by_debate_type) debate_type mean_dictionary
0 legislation 0.000459
1 opposition day 0.000485
2 prime_ministers_questions 0.000513
3 question_time 0.000613
We also check if these differences are statistically significant.
R
Call:
lm(formula = proportion ~ debate_type, data = aggression_texts_df)
Residuals:
Min 1Q Median 3Q Max
-0.000613 -0.000613 -0.000513 -0.000459 0.124541
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.591e-04 4.120e-05 11.143 < 2e-16 ***
debate_typeopposition day 2.620e-05 1.135e-04 0.231 0.81740
debate_typeprime_ministers_questions 5.402e-05 5.184e-05 1.042 0.29734
debate_typequestion_time 1.534e-04 5.247e-05 2.924 0.00346 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.003297 on 28641 degrees of freedom
Multiple R-squared: 0.0003416, Adjusted R-squared: 0.0002369
F-statistic: 3.263 on 3 and 28641 DF, p-value: 0.02047
Let us look at some examples in the order of aggression:
| proportion | body |
|---|---|
| 0.125 | I thought that I might provoke an intervention. |
| 0.1 | Will the honourable Gentleman give way on that ridiculous point? |
| 0.076923 | The Electoral Administration Bill includes several provisions to prevent fraud in postal voting. |
| 0.071429 | What recent assessment she has made of the effectiveness of the Action Fraud helpline. |
| 0.071429 | There can be no defence against the abuse of democratic rights in any country. |
| 0.065217 | In two constituency cases I have been told, "It is not in the Serious Fraud Office's remit and the police will not look at the corporate fraud because they do not have the money." So how do we get these corporate fraud cases properly looked at? |
| 0.0625 | Order. That is unsatisfactory. No honourable Member would be misleading - perhaps misinformed, but not misleading. |
| 0.0625 | What recent representations he has received on the operation of the Rehabilitation of Offenders Act 1974. |
| 0.058824 | I hope that the honourable Gentleman will not be so foolish as to trust the Government again. |
| 0.058824 | What steps she is taking to ensure that all forms of domestic abuse are recognised and investigated. |
Human judgment can be considered the typical standard: would humans code agression in the same way?
However, even human judgment could be subject to many biases including:
Asking human subjects to code speeches can also be expensive.
Large Language Models such as ChatGPT could be an easier way to cross-validate the dictionary method here.
The aggression_texts data.frame includes a variable, aggression_rating.
This is the variable that that contains the ChatGPT rating.
ChatGPT was asked to rate aggression with 1 or 0.
The following code shows the extent to which the dictionary method variable coincides with the ChatGPT rating:
And now we can extract these values separately:
Python
# Extract the values as specified
false_false = contingency_table.loc[False, 0] # dictionary = False, ChatGPT = 0
true_false = contingency_table.loc[True, 0] # dictionary = True, ChatGPT = 0
false_true = contingency_table.loc[False, 1] # dictionary = False, ChatGPT = 1
true_true = contingency_table.loc[True, 1] # dictionary = True, ChatGPT = 1There are 1,198 speeches that were categorized as aggressive by both the dictionary method and ChatGPT.
And now we can extract these values separately:
Python
# Extract the values as specified
false_false = contingency_table.loc[False, 0] # dictionary = False, ChatGPT = 0
true_false = contingency_table.loc[True, 0] # dictionary = True, ChatGPT = 0
false_true = contingency_table.loc[False, 1] # dictionary = False, ChatGPT = 1
true_true = contingency_table.loc[True, 1] # dictionary = True, ChatGPT = 1There are 17,437 speeches that were categorized as non-aggressive by both the dictionary method and ChatGPT.
We can thus try to add the TP and TN and divide them by total observations
We can think of this calculation as a measure of accuracy.
This measures the proportion of correctly classified cases (both positives and negatives) out of the total cases.
\[Accuracy = \frac{\textrm{True Positives}+ \textrm{True Negatives}}{\textrm{Total Observations}}\]
Another metric could be the naïve guess which would try to predict the most frequent class observed in your dataset for all instances.
Applying it would entail a few steps:
Accuracy = (Number of correct predictions) / (Total number of instances)
Our model’s accuracy (65%) > (63%).
The models captures some patterns beyond the most frequent class, but the improvement is modest.
Sensitivity Definition:
Calculation: true_true/(true_true+true_false)
Sensitivity = 1198/(1198 + 608) ≈ 66.33%
ChatGPT model identifies about 66% of aggressive texts marked by the dictionary.Specificity Definition:
Calculation: false_false / (false_false + false_true)
Specificity = 17437/(17437 + 9400) ≈ 64.97%
1. Accuracy: The proportion of all correctly classified texts (both aggressive and non-aggressive).
\[ \text{Accuracy} = \frac{\text{True Positives + True Negatives}}{\text{Total Observations}} \]
2. Sensitivity (Recall): Measures the model’s ability to detect aggressive texts.
\[ \text{Sensitivity} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}} \]
3. Specificity: Measures the model’s ability to detect non-aggressive texts.
\[ \text{Specificity} = \frac{\text{True Negatives}}{\text{True Negatives + False Positives}} \]
4. Naïve Guess: A simple baseline where we predict the most frequent class for all texts.
An important question here: what type of error should we try and minimize?
Evaluating Classifier Performance
Our prioritization of false-negative/false-positive rates will often depend on the application
For judicial decisions, maybe we’d prefer false negatives than false positives - Would you rather put an innocent person in jail or let a guilty one go free?
For COVID tests, we might be more happy to accept false positives that false negatives - Would you rather isolate for no reason or spread the virus unknowingly?
So, what can we say about agression within political speeches over time?
Python
import pandas as pd
import numpy as np
from scipy.stats import t
# Assuming aggression_texts_df2 is your DataFrame
aggression_trends = (aggression_texts
.groupby(['year', 'gender'], as_index=False)
.agg(mean_aggression=('aggression_rating', 'mean'),
sd_aggression=('aggression_rating', 'std'),
n=('aggression_rating', 'size'))
)
# Calculating standard error, confidence interval lower and upper bounds
aggression_trends['se'] = aggression_trends['sd_aggression'] / np.sqrt(aggression_trends['n'])
aggression_trends['ci_lower'] = aggression_trends['mean_aggression'] - t.ppf(0.975, df=aggression_trends['n'] - 1) * aggression_trends['se']
aggression_trends['ci_upper'] = aggression_trends['mean_aggression'] + t.ppf(0.975, df=aggression_trends['n'] - 1) * aggression_trends['se']R
library(reticulate)
aggression_trends2 <- reticulate::py$aggression_trends
ggplot(aggression_trends2, aes(x = year, y = mean_aggression, color = gender)) +
geom_line() +
geom_point() +
geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper, fill = gender), alpha = 0.2) +
labs(
title = "Aggression Rating Over Time by Gender",
x = "Year",
y = "Mean Aggression Rating"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_x_continuous(breaks = seq(min(aggression_trends2$year), max(aggression_trends2$year), by = 1))+
geom_hline(yintercept = 0)+
theme_bw()So, what can we say about agression within political speeches over time?
Python
import pandas as pd
import numpy as np
from scipy.stats import t
# Assuming aggression_texts_df2 is your DataFrame
aggression_trends = (aggression_texts
.groupby(['year', 'gender'], as_index=False)
.agg(mean_aggression=('proportion', 'mean'),
sd_aggression=('proportion', 'std'),
n=('aggression_rating', 'size'))
)
# Calculating standard error, confidence interval lower and upper bounds
aggression_trends['se'] = aggression_trends['sd_aggression'] / np.sqrt(aggression_trends['n'])
aggression_trends['ci_lower'] = aggression_trends['mean_aggression'] - t.ppf(0.975, df=aggression_trends['n'] - 1) * aggression_trends['se']
aggression_trends['ci_upper'] = aggression_trends['mean_aggression'] + t.ppf(0.975, df=aggression_trends['n'] - 1) * aggression_trends['se']R
aggression_trends2 <- reticulate::py$aggression_trends
# Plot the mean aggression rating over time with confidence intervals
ggplot(aggression_trends2, aes(x = year, y = mean_aggression, color = gender)) +
geom_line() +
geom_point() +
geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper, fill = gender), alpha = 0.2) +
labs(
title = "Aggression Rating Over Time by Gender",
x = "Year",
y = "Mean Aggression Rating"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_x_continuous(breaks = seq(min(aggression_trends2$year), max(aggression_trends2$year), by = 1))+
geom_hline(yintercept = 0)+
theme_bw()The graph using ChatGPT is very similar to the original article:
Python
import pandas as pd
import numpy as np
from scipy.stats import t
# Assuming aggression_texts_df2 is your DataFrame
aggression_trends = (aggression_texts
.groupby(['year', 'gender'], as_index=False)
.agg(mean_aggression=('aggression_rating', 'mean'),
sd_aggression=('aggression_rating', 'std'),
n=('aggression_rating', 'size'))
)
# Calculating standard error, confidence interval lower and upper bounds
aggression_trends['se'] = aggression_trends['sd_aggression'] / np.sqrt(aggression_trends['n'])
aggression_trends['ci_lower'] = aggression_trends['mean_aggression'] - t.ppf(0.975, df=aggression_trends['n'] - 1) * aggression_trends['se']
aggression_trends['ci_upper'] = aggression_trends['mean_aggression'] + t.ppf(0.975, df=aggression_trends['n'] - 1) * aggression_trends['se']R
library(reticulate)
aggression_trends2 <- reticulate::py$aggression_trends
aggression_trends3<-subset(aggression_trends2, year>1997)
ggplot(aggression_trends3, aes(x = year, y = mean_aggression, color = gender)) +
geom_line() +
geom_point() +
geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper, fill = gender), alpha = 0.2) +
labs(
title = "Aggression Rating Over Time by Gender",
x = "Year",
y = "Mean Aggression Rating"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_x_continuous(breaks = seq(min(aggression_trends3$year), max(aggression_trends3$year), by = 1))+
geom_hline(yintercept = 0)+
theme_bw()Hargrave and Blumenau, 2022 enhance aggression measurement by adopting a more refined approach:
Limitations of Dictionary Methods: Dictionaries often struggle to accurately capture aggression in speeches due to limited contextual understanding.
Evaluating Classifier Performance: Beyond overall accuracy, it’s essential to consider within-class metrics like specificity and sensitivity for a nuanced assessment.
Benefits of Dictionary Methods: Fast and straightforward with many ready-to-use implementations, making them easy to apply.
Context Sensitivity: The validity of dictionaries depends on the context in which they were created and applied, which can limit their generalizability.
ChatGPT has been proven to be as accurate or even more accurate than human coding when it comes to task such as as identifying agression within speeches.
See for example this article in Social Science Computer Review
Popescu (JCU): Lecture 15